Goto

Collaborating Authors

 robotic and automation


A High-Resolution Dataset for Instance Detection with Multi-View Instance Capture

Neural Information Processing Systems

One major reason is that current InsDet datasets are too small in scale by today's standards. For example, the popular InsDet dataset GMU (published in 2016) has only 23 instances, far less than COCO (80 classes), a well-known object detection dataset published in 2014.




High-Performance Dual-Arm Task and Motion Planning for Tabletop Rearrangement

Zhang, Duo, Huang, Junshan, Yu, Jingjin

arXiv.org Artificial Intelligence

Abstract-- We propose Synchronous Dual-Arm Rearrangement Planner (SDAR), a task and motion planning (T AMP) framework for tabletop rearrangement, where two robot arms equipped with 2-finger grippers must work together in close proximity to rearrange objects whose start and goal configurations are strongly entangled. T o tackle such challenges, SDAR tightly knit together its dependency-driven task planner (SDAR-T) and synchronous dual-arm motion planner (SDAR-M), to intelligently sift through a large number of possible task and motion plans. Specifically, SDAR-T applies a simple yet effective strategy to decompose the global object dependency graph induced by the rearrangement task, to produce more optimal dual-arm task plans than solutions derived from optimal task plans for a single arm. Leveraging state-of-the-art GPU SIMD-based motion planning tools, SDAR-M employs a layered motion planning strategy to sift through many task plans for the best synchronous dual-arm motion plan while ensuring high levels of success rate. Comprehensive evaluation demonstrates that SDAR delivers a 100% success rate in solving complex, non-monotone, long-horizon tabletop rearrangement tasks with solution quality far exceeding the previous state-of-the-art. Experiments on two UR-5e arms further confirm SDAR directly and reliably transfers to robot hardware. Task and motion planning (T AMP) [1] represents a fundamental computation challenge in robotics, in which a robot system, e.g., one or more robot arms, must break down a given, potentially long-horizon task into suitable "bite-sized" sub-tasks that can be executed through short-horizon robot motions.


A New Trajectory-Oriented Approach to Enhancing Comprehensive Crowd Navigation Performance

Zhou, Xinyu, Piao, Songhao, Gao, Chao, Chen, Liguo

arXiv.org Artificial Intelligence

Crowd navigation has garnered considerable research interest in recent years, especially with the proliferating application of deep reinforcement learning (DRL) techniques. Many studies, however, do not sufficiently analyze the relative priorities among evaluation metrics, which compromises the fair assessment of methods with divergent objectives. Furthermore, trajectory-continuity metrics, specifically those requiring $C^2$ smoothness, are rarely incorporated. Current DRL approaches generally prioritize efficiency and proximal comfort, often neglecting trajectory optimization or addressing it only through simplistic, unvalidated smoothness reward. Nevertheless, effective trajectory optimization is essential to ensure naturalness, enhance comfort, and maximize the energy efficiency of any navigation system. To address these gaps, this paper proposes a unified framework that enables the fair and transparent assessment of navigation methods by examining the prioritization and joint evaluation of multiple optimization objectives. We further propose a novel reward-shaping strategy that explicitly emphasizes trajectory-curvature optimization. The resulting trajectory quality and adaptability are significantly enhanced across multi-scale scenarios. Through extensive 2D and 3D experiments, we demonstrate that the proposed method achieves superior performance compared to state-of-the-art approaches.


MagicSkin: Balancing Marker and Markerless Modes in Vision-Based Tactile Sensors with a Translucent Skin

Tijani, Oluwatimilehin, Chen, Zhuo, Deng, Jiankang, Luo, Shan

arXiv.org Artificial Intelligence

Vision-based tactile sensors (VBTS) face a fundamental trade-off in marker and markerless design on the tactile skin: opaque ink markers enable measurement of force and tangential displacement but completely occlude geometric features necessary for object and texture classification, while markerless skin preserves surface details but struggles in measuring tangential displacements effectively. Current practice to solve the above problem via UV lighting or virtual transfer using learning-based models introduces hardware complexity or computing burdens. This paper introduces MagicSkin, a novel tactile skin with translucent, tinted markers balancing the modes of marker and markerless for VBTS. It enables simultaneous tangential displacement tracking, force prediction, and surface detail preservation. This skin is easy to plug into GelSight-family sensors without requiring additional hardware or software tools. We comprehensively evaluate MagicSkin in downstream tasks. The translucent markers impressively enhance rather than degrade sensing performance compared with traditional markerless and inked marker design: it achieves best performance in object classification (99.17\%), texture classification (93.51\%), tangential displacement tracking (97\% point retention) and force prediction (66\% improvement in total force error). These experimental results demonstrate that translucent skin eliminates the traditional performance trade-off in marker or markerless modes, paving the way for multimodal tactile sensing essential in tactile robotics. See videos at this \href{https://zhuochenn.github.io/MagicSkin_project/}{link}.


Perturbation-mitigated USV Navigation with Distributionally Robust Reinforcement Learning

Zhang, Zhaofan, Yang, Minghao, Xie, Sihong, Xiong, Hui

arXiv.org Artificial Intelligence

The robustness of Unmanned Surface Vehicles (USV) is crucial when facing unknown and complex marine environments, especially when heteroscedastic observational noise poses significant challenges to sensor-based navigation tasks. Recently, Distributional Reinforcement Learning (DistRL) has shown promising results in some challenging autonomous navigation tasks without prior environmental information. However, these methods overlook situations where noise patterns vary across different environmental conditions, hindering safe navigation and disrupting the learning of value functions. To address the problem, we propose DRIQN to integrate Distributionally Robust Optimization (DRO) with implicit quantile networks to optimize worst-case performance under natural environmental conditions. Leveraging explicit subgroup modeling in the replay buffer, DRIQN incorporates heterogeneous noise sources and target robustness-critical scenarios. Experimental results based on the risk-sensitive environment demonstrate that DRIQN significantly outperforms state-of-the-art methods, achieving +13.51\% success rate, -12.28\% collision rate and +35.46\% for time saving, +27.99\% for energy saving, compared with the runner-up.


Describe Anything Anywhere At Any Moment

Gorlo, Nicolas, Schmid, Lukas, Carlone, Luca

arXiv.org Artificial Intelligence

Computer vision and robotics applications ranging from augmented reality to robot autonomy in large-scale environments require spatio-temporal memory frameworks that capture both geometric structure for accurate language-grounding as well as semantic detail. Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. DAAAM introduces a novel optimization-based frontend to infer detailed semantic descriptions from localized captioning models, such as the Describe Anything Model (DAM), leveraging batch processing to speed up inference by an order of magnitude for online processing. It leverages such semantic understanding to build a hierarchical 4D scene graph (SG), which acts as an effective globally spatially and temporally consistent memory representation. DAAAM constructs 4D SGs with detailed, geometrically grounded descriptions while maintaining real-time performance. We show that DAAAM's 4D SG interfaces well with a tool-calling agent for inference and reasoning. We thoroughly evaluate DAAAM in the complex task of spatio-temporal question answering on the NaVQA benchmark and show its generalization capabilities for sequential task grounding on the SG3D benchmark. We further curate an extended OC-NaVQA benchmark for large-scale and long-time evaluations. DAAAM achieves state-of-the-art results in both tasks, improving OC-NaVQA question accuracy by 53.6%, position errors by 21.9%, temporal errors by 21.6%, and SG3D task grounding accuracy by 27.8% over the most competitive baselines, respectively. We release our data and code open-source.


Opening the Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer

Xue, Haoru, He, Tairan, Wang, Zi, Ben, Qingwei, Xiao, Wenli, Luo, Zhengyi, Da, Xingye, Castañeda, Fernando, Shi, Guanya, Sastry, Shankar, Fan, Linxi "Jim", Zhu, Yuke

arXiv.org Artificial Intelligence

Recent progress in GPU-accelerated, photorealistic simulation has opened a scalable data-generation path for robot learning, where massive physics and visual randomization allow policies to generalize beyond curated environments. Building on these advances, we develop a teacher-student-bootstrap learning framework for vision-based humanoid loco-manipulation, using articulated-object interaction as a representative high-difficulty benchmark. Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure that mitigates partial observability and improves closed-loop consistency in sim-to-real RL. Trained entirely on simulation data, the resulting policy achieves robust zero-shot performance across diverse door types and outperforms human teleoperators by up to 31.7% in task completion time under the same whole-body control stack. This represents the first humanoid sim-to-real policy capable of diverse articulated loco-manipulation using pure RGB perception.


D-LIO: 6DoF Direct LiDAR-Inertial Odometry based on Simultaneous Truncated Distance Field Mapping

Coto-Elena, Lucia, Maese, J. E., Merino, L., Caballero, F.

arXiv.org Artificial Intelligence

Published in IEEE Robotics and Automation Letters, vol. Abstract-- This paper presents a new approach for 6DoF Direct LiDAR-Inertial Odometry (D-LIO) based on the simultaneous mapping of truncated distance fields on CPU. Such continuous representation (in the vicinity of the points) enables working with raw 3D LiDAR data online, avoiding the need of LiDAR feature selection and tracking, simplifying the odometry pipeline and easily generalizing to many scenarios. The method is based on the proposed Fast Truncated Distance Field (Fast-TDF) method as a convenient tool to represent the environment, employing binary masks that encodes the L1 distance. Such representation enables i) solving the LiDAR point-cloud registration as a nonlinear optimization process without the need of selecting/tracking LiDAR features in the input data, ii) simultaneously producing an accurate truncated distance field map of the environment, and iii) updating such map at constant time independently of its size. The approach is tested using open datasets, aerial and ground. It is also benchmarked against other state-of-the-art odometry approaches, demonstrating the same or better level of accuracy with the added value of an online-generated TDF representation of the environment, that can be used for other robotics tasks as planning or collision avoidance. Accurate vehicle localization is a crucial aspect of robotics, directly influencing autonomous navigation, remote exploration, and other advanced applications. V arious techniques are employed to improve localization, combining data from different sensors such as cameras, inertial measurement units (IMUs), LiDAR and radar [1].